一、问题现象(附报错日志上下文):
Traceback (most recent call last):
File "tools/train_net.py", line 195, in
args=(args,),
File "/opt/ModelZoo-PyTorch-master/PyTorch/contrib/cv/detection/Cascade_RCNN/detectron2/engine/launch.py", line 82, in launch
main_func(*args)
File "tools/train_net.py", line 183, in main
return trainer.train()
File "/opt/ModelZoo-PyTorch-master/PyTorch/contrib/cv/detection/Cascade_RCNN/detectron2/engine/defaults.py", line 422, in train
super().train(self.start_iter, self.max_iter)
File "/opt/ModelZoo-PyTorch-master/PyTorch/contrib/cv/detection/Cascade_RCNN/detectron2/engine/train_loop.py", line 162, in train
self.run_step()
File "/opt/ModelZoo-PyTorch-master/PyTorch/contrib/cv/detection/Cascade_RCNN/detectron2/engine/train_loop.py", line 262, in run_step
scaled_loss.backward()
File "/usr/local/python3.7.5/lib/python3.7/contextlib.py", line 119, in exit
next(self.gen)
File "/usr/local/python3.7.5/lib/python3.7/site-packages/apex/amp/handle.py", line 142, in scale_loss
optimizer._post_amp_backward(loss_scaler)
File "/usr/local/python3.7.5/lib/python3.7/site-packages/apex/amp/_process_optimizer.py", line 397, in post_backward_with_master_weights
self._amp_combined_init()
File "/usr/local/python3.7.5/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
return func(*args, **kwargs)
File "/usr/local/python3.7.5/lib/python3.7/site-packages/apex/amp/_process_optimizer.py", line 661, in combined_init_with_master_weights
stash.main_fp16_grad_combine, stash.fp16_grad_list = get_grad_combined_tensor_from_param(stash.all_fp16_params)
File "/usr/local/python3.7.5/lib/python3.7/site-packages/apex/amp/_process_optimizer.py", line 34, in get_grad_combined_tensor_from_param
original_combined_tensor = combine_npu(list_of_grad)
File "/usr/local/python3.7.5/lib/python3.7/site-packages/apex/contrib/combine_tensors/combine_tensors.py", line 27, in combine_npu
combined_tensor = torch.zeros(total_numel, dtype=dtype).npu()
RuntimeError: ACL stream synchronize failed, error code:507015
THPModule_npu_shutdown success
二、软件版本:
--CANN 版本:6.0.RC1
--固件驱动版本:5.1.rc2
--Pytorch 版本:torch1.5-20211229
--Python 版本:Python 3.7.5
--操作系统版本 :Ubuntu 20.04.1 LTS
--架构:x86
--模型脚本:Cascade_RCNN
三、硬件版本:
华为A800-9010 910B
四、日志情况
五、尝试操作:
更换版本
--CANN 版本:5.0.3
--驱动版本:21.0.3.2
--固件版本:1.79.22.7.220
--Pytorch 版本:torch1.5-20211229
报错RuntimeError: ACL stream synchronize failed, error code:507015
与此前一致
|